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Abstract 

Information theoretic active learning has been widely studied for prob- 
abilistic models. For simple regression an optimal myopic policy is easily 
tractable. However, for other tasks and with more complex models, such 
as classification with nonparametric models, the optimal solution is harder 
to compute. Current approaches make approximations to achieve tractabil- 
ity. We propose an approach that expresses information gain in terms 
of predictive entropies, and apply this method to the Gaussian Process 
Classifier (GPC). Our approach makes minimal approximations to the full 
information theoretic objective. Our experimental performance compares 
favourably to many popular active learning algorithms, and has equal or 
lower computational complexity. We compare well to decision theoretic 
approaches also, which are privy to more information and require much 
more computational time. Secondly, by developing further a reformulation 
of binary preference learning to a classification problem, we extend our 
algorithm to Gaussian Process preference learning. 

1 Introduction 

In most machine learning systems, the learner passively collects data with which 
it makes inferences about its environment. In active learning, however, the 
learner seeks the most useful measurements to be trained upon. The goal of 
active learning is to produce the best model with the least possible data; this 
is closely related to the statistical field of optimal experimental design. With 
the advent of the internet and expansion of storage facilities, vast quantities 
of unlabelled data have become available, but it can be costly to obtain labels. 
Finding the most useful data in this vast space calls for efficient active learning 
algorithms. 

Two approaches to active learning are to use decision and information the- 
ory [Kapoor et al., 2007[ [Lindley, 1956| . The former minimizes the expected 



losses encountered after making decisions based on the data collected i.e. min- 
imize the Bayes posterior risk |Roy and McCallum, 2001] . Maximising perfor- 
mance under test is the ultimate objective of most learners, however, evaluat- 
ing this objective can be very hard. For example, the methods proposed in 
[Kapoor et al., 2007| |Zhu et al., 2003| for classification are in general expensive 
to compute. Furthermore, one may not know the loss function or test distribution 
in advance, or may want the model to perform well on a variety of loss functions. 
In extreme scenarios, such as exploratory data analysis, or visualisation, losses 
may be very hard to quantify. 

This motivates information theoretic approaches to active learning, which are 
agnostic to the decision task at hand and particular test data, this is known an 
inductive approach. They seek to reduce the number of feasible models as quickly 
as possible, using either heuristics (e.g. margin sampling Tong and KoUer, 2001| ) 
or by formalising uncertainty using well studied quantities, such as Shannons 
entropy and the KL-divergence [Cover et al., 1991 . Although the latter approach 



was proposed several decades ago Lindley, 1956 Bernardo, 1979| , it is not always 
straightforward to apply the criteria to complicated models such as nonparametric 
processes with infinite parameter spaces. As a result many algorithms exist 
which compute approximate posterior entropies, perform sampling, or work with 
related quantities in non-probabilistic models. 

We return to this problem, presenting the full information criterion and 
demonstrate how to apply it to Gaussian Processes Classification (GPC), yielding 
a novel active learning algorithm that makes minimal approximations. GPC is a 
powerful, non-parametric kernel-based model, and poses an interesting problem 
for information-theoretic active learning because the parameter space is infinite 
dimensional and the posterior distribution is analytically intractable. We present 
the information theoretic approach to active learning in Section 2. In Section 
3 we apply it to GPC, and show how to extended our method to preference 
learning. In Section 4 we review other approaches and how they compare to our 
algorithm. We take particular care to contrast our approach to the Informative 
Vector Machine, that addresses data point selection for GPs directly. We present 
results on a wide variety of datasets in Section 5 and conclude in Section 6. 



2 Bayesian Information Theoretic Active Learn- 
ing 

We consider a fully discriminative model where the goal of active learning is 
to discover the dependence of some variable ?/ G 3^ on an input variable x € X. 
The key idea in active learning is that the learner chooses the input queries 
Xi € X and observes the system's response yi, rather than passively receiving 
(x^y^) pairs. 

Within a Bayesian framework we assume existence of some latent param- 
eters, 6, that control the dependence between inputs and outputs, p{y\x,6). 
Having observed data V — {(jCj, yi)}"=iJ ^ posterior distribution over the pa- 



2 



rameters is inferred, p{9\T>). The central goal of information theoretic ac- 
tive learning is to reduce the number possible hypotheses maximally fast, i.e. 
to minimize the uncertainty about the parameters using Shannon's entropy 
[Cover et al., 1991] . Data points V are selected that satisfy argmin^/ H[0|I?'] = 
— J p{6\T^') logp{9\'D')d9. Solving this problem in general is NP-hard; however, 
as is common in sequential decision making tasks a myopic (greedy) approxi- 
mation is made Heckerman et al., 1995|. It has been shown that the myopic 



policy can perform near-optimally Golovin and Krause, 2010[ [Dasgupta, 2005| . 
Therefore, the objective is to seek the data point x that maximises the decrease 
in expected posterior entropy: 



argmaxH[0p] - Ey^p^yi^v) [H[6»|j/, a;, P]] (1) 

Note that expectation over the unseen output y is required. Many works 
e.g. [MacKay, 1992[[Krishnapuram et al^TI [Lawrence et al., 2003| propose using 
this objective directly. However, parameter posteriors are often high dimen- 
sional and computing their entropies is usually intractable. Furthermore, for 
nonparametric processes the parameter space is infinite dimensional so Eqn. ([!]) 
becomes poorly defined. To avoid gridding parameter space (exponentially hard 
with dimensionality), or sampling (from which it is notoriously hard to estimate 
entropies without introducing bias [Panzeri and Petersen, 2007] ), these papers 
make Gaussian or low dimensional approximations and calculate the entropy of 
the approximate posterior. A second computational difficulty arises; if data 
points are under consideration, and Ny responses may be seen, then 0{NxNy), 
potentially expensive, posterior updates are required to calculate Eqn. ([T]). 

An important insight arises if we note that the objective in Eqn. ([T]) is 
equivalent to the conditional mutual information between the unknown output 
and the parameters, yja;, 2?]. Using this insight it is simple to show that the 
objective can be rearranged to compute entropies in y space: 



argmaxH[?/|a;,X'] - Ee^p(e|x,) [H[2/|a;,0]] (2) 

X 

Eqn. ([2]) overcomes the challenges we described for Eqn. ([T]) . Entropies are now 
calculated in, usually low dimensional, output space. For binary classification, 
these are just entropies of Bernoulli variables. Also 6 is now conditioned only 
on V, so only C(l) posterior updates are required. Eqn. ^ also provides us 
with an interesting intuition about the objective; we seek the x for which the 
model is marginally most uncertain about y (high H[y|a;,2?]), but for which 
individual settings of the parameters are confident (low Eor^p^eiv) [H[y|^j^]])- 
This can be interpreted as seeking the x for which the parameters under the 
posterior disagree about the outcome the most, so we refer to this objective as 
Bayesian Active Learning by Disagreement (BALD). We present a method to 
apply Eqn. ^ directly to GPC and preference learning. We no longer need to 
build our entropy calculation around the type of posterior approximation (as 
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in [MacKay, 1992[ Krishnapuram et al., Lawrence et al., 2003| ) but are free to 



choose from many of the available algorithms. Minimal additional approximations 
are introduced, and so, to our knowledge our algorithm represents the most 
exact and fastest way to perform full information-theoretic active learning in 
non-parametric discriminative models. 



3 Gaussian Processes for Classification and Pref- 
erence Learning 

In this section we derive the BALD algorithm for Gaussian Process classification 
(GPC). GPs are a powerful and popular non-parametric tool for regression 
and classification. GPC appears to be an especially challenging problem for 
information-theoretic active learning because the parameter space is infinite, 
however, by using ([2| we are able to calculate fully the relevant information 
quantities without having to work out entropies of infinite dimensional objects. 
The probabilistic model underlying GPC is as follows: 

/^GP(Ai(-),fc(.,.)) 

y\x, f ^ Bernouni($(/(a;))) 

The latent parameter, now called / is a function A' — ?> K, and is assigned a 
Gaussian process prior with mean /i(-) and covariance function or kernel k{-, •). 
We consider the probit case where given the value of /, y takes a Bernoulli 
distribution with probability $(/(a;)), and $ is the Gaussian CDF. For further 
details on GPs see [Rasmussen and Williams, 2005| . 

Inference in the GPC model is intractable; given some observations V, the 
posterior over / becomes non-Gaussian and complicated. The most commonly 
used approximate inference methods - EP, Laplace approximation. Assumed 
Density Filtering and sparse methods - all approximate the posterior by a 
Gaussian [Rasmussen and Williams, 2005| . Throughout this section we will 
assume that we are provided with such a Gaussian approximation from one 
of these methods, though the active learning algorithm does not care which 
one. In our derivation we will use « to indicate where such an approximation is 
exploited. 

The informativeness of a query x is computed using Eqn. ([2]). The entropy 
of the binary output variable y given a fixed / can be expressed in terms of the 
binary entropy function h: 

E[y\xJ]=hmfix)) 

Hp) = -piogp - (1 - p) iog(i - p) 

Expectations over the posterior need to be computed. Using a Gaussian approxi- 
mation to the posterior, for each x, fx — fix) will follow a Gaussian distribution 
with mean /ix,!? and variance cr^ v- To compute Eqn. ^ we have to compute 
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two entropy quantities. The first term in Eqn. ([2]), H[y|£c,X'] can be handled 
analyticaUy for the probit case: 



R[y\x,V] « h / HUW{Ufi^^v,'jlv)dL 




(3) 



The second term, ^fr^p{f\v) [H[y|a;, /]] can be computed approximately as follows: 
E/~p(/|i3) [il[y\xj]] 

H<i>{U))U{U\fi^,v,'jlv)dU (4) 

A/'(/x|Mx,D,cr^,D)c!/'x 




where C — ^ ^^2"^ ■ The first approximation, «, reflects the Gaussian ap- 
proximation to the posterior. The integral in the left hand side of Eqn. Q is 
intractable. By performing a Taylor expansion on lnh($(/3;)) (see supplementary 
material) we can see that it can be approximated up to 0{f^) by a squared 
exponential curve, exp(— /2/7rln2). We will refer to this approximation as Si. 
Now we can apply the standard convolution formula for Gaussians to finally get 
a closed form expression for both terms of Eqn. ([2]). 

Fig. [1] depicts the striking accuracy of this simple approximation. The max- 
imum possible error that will be incurred when using this approximation is if 
■^ifx\lJ'x,v,o'xv) is centred at /ix,-D = ±2.05 with cr^x> tending to zero (see 
Fig. [I] absolute error |- - -] ) , yielding only a 0.27% error in the integral in Eqn. Q. 
The authors are unaware of previous use of this simple and useful approximation 
in this context. In Section [5] we investigate experimentally the information lost 
from approximations and « as compared to the golden standard of extensive 
Monte Carlo simulation. 

To summarise, the BALD algorithm for Gaussian process classification con- 
sists of two steps. First it applies any standard approximate inference algorithm 
for GPCs (such as EP) to obtain the posterior predictive mean fj.x^x> and (J^.v for 
each point of interest x. Then, it selects a query x that maximises the following 
objective function: 




Figure 1: Analytic approximation (w) to the binary entropy of the error function 

Q by a squared exponential ( | p . The absolute error ( | P remains under 

3 • 10-3. 

For most practically relevant kernels, the objective ([5]) is a smooth and 
differentiable function of a;, so gradient-based optimisation procedures can be 
used to find the maximally informative query. 

3.1 Extension: Learning Hyperparameters 

In many applications the parameter set naturally divides into parameters of 
interest, and nuisance parameters 0^, i.e. 9 = {6^,9^}. In such settings, 
the active learning may want to query points that are maximally informative 
about 0+ , while not caring about 9~ . By integrating Eqn. ([l]) over the nuisance 
parameters, 9~ , BALD's objective is re-derived as: 

H[Ep(e+,e-|i,) [y\x,9+,e-]] 

-Ep^e+\v) [R[Ep^g-\g+,j,)[y\x,9+,e']\] (6) 

In the context of GP models, hyperparameters typically control the smooth- 
ness or spatial length-scale of functions. If we maintain a posterior distribution 
over these hyperparameters, which we can do e. g. via Hamiltonian Monte Carlo, 
we can choose either to treat them as nuisance parameters 9~ and use Eq.[6j or to 
include them in 9^ and perform active learning over them as well. In certain cases, 
such as automatic relevance determination [Rasmussen and Wilhams, 2005| , it 
may even make sense to treat hyperparameters as variables of primary interest, 
and the function / itself as nuisance parameter 9~ . 

3.2 Preference Learning 

Our active learning framework for GPC can be extended to the important problem 
of preference learning [Fiirnkranz and Hiillermeier, 2003] |Chu and Ghahramani, 2005| . 
In preference learning the dataset consists for pairs of items {ui, Vi) G with 
binary labels, yi £ {0, 1}. yi — 1 means instance Ui is preferred to Vi, denoted 
Ui >- Vi. The task is to predict the preference relation between any (u,v). 
We can view this as a special case of building a classifier on pairs of inputs 
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h : X"^ ^ {0, !}■ |Chu and Ghahramani, 2005] propose a Bayesian approach, 
using a latent preference function /, over which a GP prior is defined. The 
model predicts preference, Ui >~ Vi whenever f{ui) + > f{vi) + e^. , where 
e„ . , €„. denote additive Gaussian noise. Under this model, the likelihood of / 
becomes: 



F[y=l\{u„v,)J]^¥[u,yv,\f] 



By rescaling the latent function /, it can be assumed w.l.o.g. that \/2cr„oise = 
1. The likelihood only depends on the difference between f{u) and f{v). We 
therefore define g{u,v) = f{u) — f{v), and do inference entirely in terms 
of g, for which the likelihood becomes the same as for probit classification: 
y\u,v,f ^ Bernoulli($(g(M, t>))). We observe that a GP prior is induced on g 
because it is formed by performing a linear operation on /, for which we have a 
GP prior already / ^ GP(0, k). We can derive the induced covariance function 
of g as (derivation in the Supplementary material) as: /cprcf ((wi, I'i), {uj,Vj)) = 
k{ui,Uj) + k{v,,Vj) - k{ui,Vj) - k{vi,u.j). 

Note that this kernel fcprcf respects the anti-symmetry properties desired for 
a preference learning scenario, i.e. the value g{u, v) is perfectly anti-correlated 
with g{v,u), ensuring V[u )^ v] = 1 — P[v >- u] holds. Thus, we can conclude 
that the GP preference learning framework of |Chu and Ghahramani, 2005| , is 
equivalent to GPC with a particular class of kernels, that we may call the 
preference judgement kernels. Therefore, our active learning algorithm presented 
in Section [3] for GPC can readily be applied to pairwise preference learning also. 

4 Related Methodologies 

There are a number of closely related algorithms for active classification which 
we now review. 

The Informative Vector Machine (IVM): Perhaps the most closely re- 
lated approach is the IVM Lawrence et al., 2003| . This popular, and successful 



approach to active learning was designed specifically for GPs; it uses an infor- 
mation theoretic approach and so appears very similar to BALD. The IVM 
algorithm was designed for subsampling a dataset for training a GP, so it is privy 
to the y values before including a measurement; it cannot therefore work explic- 
itly in output space i.e. with Eqn. The IVM uses Eqn. ([T]), but parameter 
entropies are calculated approximately in the marginal subspace corresponding 
to the observed data points. The entropy decrease after inclusion of a new data 
point can then be calculated efficiently using the GP covariance matrix. 

Although the IVM and BALD are motivated by the same objective, they work 
fundamentally differently when approximate inference is carried out. At any time 
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both methods have an approximate posterior gt(0|I'), this can be updated with 
the hkehhood of a new data point p{yt+i\f, a^t+i), yielding 4.1 Xt+i, yt+i) = 
^qt{0\'D)p{yt+i\f,Xt+i). If the posterior at i + 1 is approximated directly one 
gets qt-^-i{9\'D,Xt+i,yt+i)- BALD calculates the entropy difference between 
and Pt+i, without having to compute qt+i for each candidate x. In contrast, 
the IVM calculates the entropy change between qt and qt+i- The IVM's ap- 
proach cannot calculate the entropy of the full infinite dimensional posterior, 
and requires 0{NxNy) posterior updates. To do these updates efficiently, ap- 
proximate inference is performed using Assumed Density Filtering (ADF). Using 
ADF means that qt+i is a direct approximation to pt+i, indicating that the 
IVM makes a further approximation to BALD. Since BALD only requires 0(1) 
posterior updates it can afford to use more accurate, iterative procedures, such 
as EP. 



Information Theoretic approaches: Maximum Entropy Sampling (MES) 
[Sebastiani and Wynn, 2000] explicitly works in dataspace (Eqn. ([2])). MES was 
proposed for regression models with input-independent observation noise. Al- 
though Eqn. ([2| is used, the second term is constant because of input independent 
noise and is ignored. One cannot, however, use MES for heteroscedastic re- 
gression or classification; it fails to differentiate between model uncertainty and 
observation uncertainty (about which our model may be confident). Some toy 
demonstrations show this 'information based' active learning criterion performing 
pathologically in classification by repeatedly querying points close the decision 



boundary or in regions of high observation uncertainty e.g. Huang et al., 2010 
This is because MES is inappropriate in this domain; BALD distinguishes be- 
tween observation and model uncertainty and eliminates these problems as we 
will show. 

Mutual-information based objective functions are presented in [Ertin et al., [ 
|Fuhrmann,"2003l . They maximise the mutual information between the variable 
being measured and the variable of interest. Fuhrmann Fuhrmann, 2003 applies 
this to linear Gaussian models and acoustic arrays, Ertin et al. [Ertin et al., | to 
a communications channel. Although related, these objectives do not work with 
the model parameters and are not applied to classification. [Guestrin et al., 2005[ 
[Krause et al., 200"6] also use mutual information. They specify interest points in 
advance and maximise the expected mutual information between the predictive 
distributions at these points and at the observed locations. Although this 
is a objective is promising for regression, it is not tractable for models with 
input-dependent observation noise, such as classification or preference learning. 



Decision theoretic: We briefly mention decision theoretic approaches to ac- 
tive learning. Two closely related algorithms, [K apoor et al., 2007[|Zhu et al., 2003] , 
seek to minimize the expected cost i.e. loss weighted misclassification probability 
on all seen and future data. These methods observe the locations of the test 
points and their objective functions become monotonic in the predictive entropies 
at the test points. [Kapoor et al., 2007] also includes an empirical error term 
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Figure 2: Percentage approximation error (±1 s.d.) for different methods of 
approximate inference (columns) and approximation methods for evaluating 
Eqn. Q {rows). The results indicate that « is a very accurate approximation; 
EP causes some loss and Laplace significantly more, which is in line with the 
comparison presented in |Kuss and Rasmussen, 2005| . For our experiments we 
use EP. 

that can yield pathological behaviour (we investigate this experimentally) . These 
approaches are computationally expensive, requiring 0{NxNy) posterior updates. 
Also, they must know the locations of the test data (and thus are transductive 
approaches); designing an inductive, decision-theoretic algorithm is an open, 
hard problem as it would require expensive integration over possible test data 
distributions. 

Non-probabilistic Some non-probabilistic methods have close analogues to 
information theoretic active learning. Perhaps the most ubiquitous is active 
learning for SVMs |Tong and KoUer, 200H|Seung et al., 1992| , where the volume 
of Version Space (VS) is used as a proxy for the posterior entropy. If a uniform 
(improper) prior is used with a deterministic classification likelihood, the log 
volume of VS and Bayesian posterior entropy are in fact equivalent. Just as 
Bayesian posteriors become intractable after observing many data points, VS can 
become complicated. Tong and Roller, 2001| proposes methods for approximat- 
ing VS with a simple shapes, such as hyperspheres (their simplest approximation 
reduces to margin sampling). This closely resembles approximating a Bayesian 
posterior using a Gaussian distribution via the Laplace or EP approximations. 
[Seung et al., 199*2] sidesteps the problem by working with predictions. The al- 
gorithm. Query by Committee (QBC), samples parameters from VS (committee 
members), they vote on the outcome of each possible x. The x with the most 
balanced vote is selected; this is termed the 'principle of maximal disagreement'. 
If BALD is used with a sampled posterior, query by committee is implemented 
but with a probabilistic measure of disagreement. QBC's deterministic vote 
criterion discards confidence in the predictions and so can exhibit the same 
pathologies as MES. 

5 Experiments 

Quantifying Approximation Losses: To obtain ^ we made two approx- 
imations: we perform approximate inference («), and we approximated the 
binary entropy of the Gaussian CDF by a squared exponential (I;). Both of 
these can be substituted with Monte Carlo sampling, enabling us to compute 
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Figure 3: Top: Evaluation on artificial datasets. 
are shown with black squares ^ and red circles 
learning with nine methods: random query 
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•|. Bottom: Results of active 
BALD(f^^, MES f^ , 



active SVM 



IVM 



|Zhu et al., 2003| \ ) and empirical error \ \ 



committee members, 
decision theoretic: [Kapoor et al., 2007] \ [ ), 



an asymptotically unbiased estimate of the expected information gain. Using 
extensive Monte Carlo as the 'gold standard', we can evaluate how much we 
loose by applying these approximations. We quantify approximation error as: 



max^ep I{x) - /(arg max^g ^ /(a;)) ^ ^^^^^ 



I{x) 



where / is the objective computed using Monte Carlo, / is the approximate 
objective. The cancer UCI dataset was used, results and discussion are in Fig.[2j 

Pool based active learning: We test BALDfor GPC and preference learning 
in the pool-based setting i.e. selecting x values from a fixed set of data-points. 
Although BALD can generalise to selecting continuous x, this enables us to 
compare to algorithms that cannot. We compare to eight other algorithms: 
random sampling, MES, QBC (with 2 and 100 committee members), SVM 
with version space approximation [Tong and K oUer, 2001', decision theoretic 
approaches in [Kapoor et al., 2007| |Zhu et aiT7"20G3j and directly minimizing 
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Figure 4: Test set classification accuracy on classification and preference learning 

datasets. Methods used are BALD(^— ), random query \ [ ), MES \ P , 

QBC with 2 (QBC2, ' ■ ■ ' 

SVM P"^ , IVM " 



and 100 (QBCioo, 
decision theoretic 
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Kapoor et al., 2007| \ [ ), deci- 



sion theoretic |Zhu et al., 2003| \ \ and empicial error ( ). The decision 



theoretic methods took a long time to run, so were not completed for all datasets. 
Plots (a-i) are GPC datasets, (j-1) are preference learning. 
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Figure 5: Summary of results for all classification experiments, y-axis denotes 
the number of additional data points, relative to BALD, required to achieve at 
least 97.5% of the predictive performance of the entire pool. The 'box' denotes 
25th to 75th percentile, the red line denotes the median over datasets, and the 
'whiskers' depict the range. The crosses denote outliers (> 2.7ct from the mean). 
Positive values mean that the algorithm required more data points than BALD 
to achieve the same performance. 



expected empirical error (the last is not a widely used method, but is included 
for analysis of [Kapoor et al., 2007] ). 

We consider three artificial, but challenging, datasets. The first of which, block 
in the middle, has a block of noisy points on the decision boundary, the second 
block in the corner, has a block of uninformative points far from the decision 
boundary: a strong active learning algorithm should avoid these uninformative 
regions. The third is similar to the checkerboard dataset in |Zhu et al., 2003] , 
and is designed to test the algorithm's capabilities to find multiple disjoint islands 
of points from one class. The three datasets and results using each algorithm 
are depicted in Fig.|3] 

Results are also presented on eight UCI classification datasets australia, crabs, 
vehicle, isolet, cancer, wine, wdbc and letter. Letter is a multiclass dataset for 
which we select hard-to-distinguish letters E vs. F and D vs. P. For preference 
learning we use the cpu, cart and kine matics regression datasets p rocessed 
to yield a preference task as described in |Chu and Ghahramani, 2005| . Results 
are plotted in Fig.|4j and Fig. [5] depicts an aggregation of the results. 

Discussion: Figs.[3] and |4] show that by using BALDwe make significant gains 
over naive random sampling in both the classification and preference learning 
domains. Relative to other active learning algorithms BALDis consistently the 

^http:/ /www. liacc.up.pt/ Itorgo/Regression/DataSets.html 
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best, or amongst the best performing algorithms on ah datasets. On any individ- 
ual dataset BALD's performance is often matched because we compare to many 
methods, and the more approximate algorithms can have good performance 
under different conditions. Fig. [5] reveals that BALD has the best overall perfor- 
mance; on average, all other methods require more data points to achieve the 
same classification accuracy. Zhu et al.'s decision theoretic approach is closest, 
the median increase in the number of data points required is 1.4 and zero (i.e. 
equivalent to BALD) is within the inter-quartile range. This algorithm, however, 
requires much more computational time and has access to the full set of test 
inputs, which BALD does not have. MES and QBC appear close in performance 
to BALD, but the zero line falls outside both of their inter-quartile ranges. 

As expected, MES performs poorly on the noisy dataset (Fig.jsj^a)) because 
it discards knowledge of observation noise. When there is zero observation noise 
it is equivalent to BALD e.g. Fig.jsf^c). On many of the real-world datasets MES 
performs as well as BALD e.g. Fig.[4][b, e), indicating that these datasets are 
mostly noise-free. 

The IVM performs well on Fig.jsj^c), but pathologically onjsjja); this is due 
to the fact that it biases selection towards points from only one class in the 
noisy cluster, reducing the posterior entropy rapidly but artificially. However, 
it also performs significantly worse than BALD on noise-free (indicated by 
MES's strong performance) datasets e.g. Fig.jljb). This implies that the IVM's 
posterior approximation or the ADF update are detrimental to the algorithm's 
performance. 

QBC often yields only a small decrement in performance, the sampling 
approximation is often not too detrimental. However, it performs poorly on the 
noisy artificial dataset (Fig.jsj^a)) because the vote criterion is not maintaining 
a notion of inherent uncertainty, like MES. The SVM-based approach exhibits 
variable performance (it does well on Fig.|4](d), but very poorly on|4]^f)). The 
performance is greatly effected by the approximation used, for consistency we 
present here one that yielded the most consistent good performance. 

Decision theoretic approaches sometimes perform well, on [Sjjc) they choose 
the first 16 points from the centre of each cluster as they are influenced by the 
surrounding unlabelled points. BALDdoes not observe the unlabelled points so 
may not pick points from the centres. Fig. [5] reveals that BALD is performing 
as well as the method in |Zhu et al., 2003| , and outperforms the approach in 
[Kapoor et al., 2007] , despite not having access to the locations of the test 
points and having a significantly lower computational cost. The objective in 
[Kapoor et al., 2007| can fail, this is because one term in their objective function 
is the empirical error. The weight given to this term is determined by the relative 
sizes of the training and test set (and the associated losses). Directly minimizing 
empirical error usually performs very pathologically, picking only 'safe' points. 
When the method in [Kapoor et al., 2007] assigns too much weight to this term, 
it can fail also. 

Finally we note that BALD may occasionally perform poorly on the first few 
data points (e.g. Fig.|4|^l)). This is may be because the hyperparameters are 
fixed throughout the experiments to provide a fair comparison to algorithms 
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incapable of incorporating hypcrparamctcr learning. This may mean that given 
little data the GP model overfits, leading to BALD selecting abnormal query 
locations. Maintaining a distribution over hyperparameters can be done using 
MCMC, although this significantly increases computational time. Designing a 
general method to do this efficiently is a subject of further work. In practice, a 
simple heuristic such as picking the first few points randomly, and optimising 
hyperparameters will usually suffice. 

6 Conclusions 

Wc have demonstrated a method that applies the full information theoretic active 
learning criterion to GP classification that makes, as far as the authors are aware, 
the smallest number of approximations to date, and has as good computational 
complexity. We extend the GPC model to develop a new preference learning 
kernel, which enables us to apply our active learning algorithm directly to 
this domain also. The method can handle naturally active learning of kernel 
hyperparameters, which is a hard, mostly unsolved problem, for example in 
SVM active learning. One notable feature of our approach is that it is agnostic 
to the approximate inference methods used. This allows us to choose from 
a whole range of approximate inference methods, including EP, the Laplace 
approximation, ADF or even sparse online learning, and thereby make the 
trade off between computational complexity and accuracy. Our experimental 
performance compares favoTirably to many other active learning methods for 
classification, and even decision theoretic methods that have access to the test 
data and require much greater computational time. 
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APPENDIX - SUPPLEMENTARY MATERIAL 

2 

Taylor Expansion for Approximation ki 

We perform a Taylor expansion on lnH[$(a;)] as follows: 



fix) = m + 



fiO)x , /"(0)x2 



1! 



fix) = lnH[4>(a;)] 
1 

"ln2H[$(a;)] 

1 $'(a;)2 



ln2H[$(x)]2 
_ 1 

ln2H[$(a;)] 
_ 1 $^(x)2 

ln2H[$(x)] 



2! 

[ln$(x)-ln(l-$(a:))] 
[ln$(x)-ln(l-$(a;))] 
[ln$(a:)-ln(l-$(x))] 



1 



1 



(l-$(x) 



.•.lnH[$(a;)] = l-— + 



Because the function is even, we can inspect that the term will be zero. 
Therefore, exponentiating, we make the approximation up to 0{x'^): 



H[$(a 



exp 



7rln2 



Preference Kernel 



The mean /Upref , and covariance function kp^ef of the GP over g can be computed 
from the mean and covariance of / GP(/i, k) as follows: 

fcpref([Mi, Vi], [Uj,Vj]) = Cov[g{Ui,Vi),g{Uj,Vj)] 

= Gov [{f{ui) - f{vi)) , {f{ui) - f{vi))] 
= E[(/(t.,)-/K)) •(/(«.) -/K))] 

- - ^^{V^)) {n{Vj) - lliUi)) 

= k{ui, Uj) + k{vi, Vj) 

- k{ui, Vj) - k{vi, Uj) (9) 
Mpref ([m, v]) = E [g{[u, v])] = E [f{u) - f{v)] 

= fi{u) - ii{v) (10) 
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